Representing Semantic Relationships in Ancient IE Languages

A Pilot Study

Anton Vinogradov, Gabriel Wallace, and Andrew Byrd

University of Kentucky

2024-10-25

Overview of Talk

  • What led us to do this?
  • Tactic #1: WordNet
  • Tactic #2: Reconstructing Word Embeddings using Descendant Languages
  • Recap & Future Directions

Anton sends his regrets

Dr. Anton Vinogradov,(recent!) PhD, Computer Science

What led us to do this?

DERBi PIE

Database of Etymological Roots Beginning in PIE

DERBi PIE

  • Etymological database, with multiple, linked references
    • all of LIV parsed (thanks Thomas Olander!)
    • half of Pokorny fully parsed (will finish next July, provided funding)
    • will finish parsing NIL this December
    • ultimately: everything (?!?)
  • Have applied for NEH funding, hopefully this will put us closer to the goal of releasing to public a year from now.

DERBi PIE: Query Searches

  • Search Functions Created (or we know how to)
    • Integrated Texts - identify roots, stems, and words in texts
    • Phonological Search - identify roots, stems, and words by phonological shape (regex)
    • Morphological Search - identify roots, stems, and words by morphological property (POS, class, gender, etc.)

DERBi PIE: Query Searches


  • Quickly realized that to identify semantic categories and relationships – especially ones that make sense across languages – is not an easy task
    • Given cross-linguistic variation, there is no unified classification of “words”

DERBi PIE: What could this do for us?

  • In DERBi PIE, having an integrated semantic system could provide automated answers to questions such as:

    • Are certain sound sequences associated with certain meanings or semantic spheres?

    • Are certain morphological derivations associated with certain meanings or semantic spheres?

    • How have meanings changed over time into the various branches and daughter languages?

DERBi PIE: the Problem




  • #1: How to create a system of semantic properties and relationships that translates across IE languages?

  • #2: Is it even possible to do this?
    • (We still don’t know.)

Tactic #1: WordNet

WordNet: What is it?


  • A large, organized lexical database of English words
  • Groups words into “synsets” (sets of synonyms) based on meanings

WordNet: What is it?

  • Provides relationships between words, such as:
    • Synonyms (similar meanings, e.g. “dog”, “pooch”)
    • Antonyms (opposite meanings, e.g. “bad”, “good”)
    • Hypernyms (general terms, e.g., “animal” for “dog”)
    • Hyponyms (specific terms, e.g., “dog” for “animal”)

WordNet: What is it? (Kazakov & Dobnik 2003)


Kazakov & Dobnik 2003

WordNet: Linked WordNets for Ancient Indo-European languages (Zanchi & Ginevra)

WordNet: Building One for PIE?

  • Working with a team of UK CS undergraduates, we mapped a list of PIE roots and words (primarily from Pokorny) onto the English WordNet structure
  • 1500 roots successfully mapped, though many are false matches (‘golf’?)

WordNet: Building One for PIE?

  • Numerous substantive difficulties with entries that do not map neatly:
    • *aghlu- (IEW, p. 8) ‘dark cloud; rainy weather’: new hypernym (of both ‘cloud’ and ‘weather’) needed
    • *ab- (IEW, p. 1) ‘water, river’: new hypernym needed? or two mappings?
  • The general idea of English (or any other language) PIE is difficult to implement, because PIE ≠ English (etc.)!

WordNet: Broader Problems

  • Limited Scope of Meanings: Doesn’t capture all nuances of word usage
  • Lack of Context: Doesn’t account for how context alters word meaning
  • Not All Languages Have WordNet: especially true for ancient/fragmentary languages
  • MUST BE DONE MANUALLY

Tactic #2: Reconstructing Word Embeddings using Descendant Languages

Word Embeddings: Overview

  • Second approach is one more in line with what is done in present-day NLP – identifying semantic relationships through word embeddings.

  • To do so, we must:

    • Process a corpus for tokens and then lemmas;

    • Analyze the environments in which these lemmas occur;

    • Take this information to construct a semantic “hyperspace”

Word Embeddings: Hyperspace

Plots like these can be created with generated vectors!

Word Embeddings: Hyperspace…?

  • If you were to plot out every vector generated from one of these models, you would have a hyperspace or a semantic space, with the dimensions of each word vector essentially acting as coordinates.

  • The closer two words lie to each other in this space, the closer in semantic value they are.

  • It’s possible to adjust how many dimensions you generate for each vector, but in general the more the better.

    • More fancy math allows us to ‘simplify’ these vectors into 2 or 3 for easy viewing, as seen before.

Word Embeddings: Processing the Text for Tokens

Word Embeddings: Processing the Text for Lemmas

Word Embeddings: Identifying the Context

Word Embeddings: Constructing the Hyperspace

  • Using a word embeddings model such as word2vec or fastText, run the tokenized and lemmatized text through it to generate word vectors.
  • These vectors will be based on the words’ positions within the text.

Word Embeddings: Constructing the Hyperspace

  • Another Latin example:

Word Embeddings: Calculate Word Similarities

  • To get a numerical representation of the similarity between two vectors, we use their cosine similarity (cosine of the angle between the vectors).
  • Similarity scores range from -1 to 1, with -1 indicating opposite vectors and 1 indicating proportional vectors.

Fancy formula from Wikipedia

Word Embeddings: Example of Cosine Similarity

Word Embeddings: Reconstructing a Hyperspace in Languages without a Corpus

  • We can construct a hyperspace for ancient languages;
  • But how does one do this for languages without any known corpus, such as PIE?
  • Well, how do we identify other properties of PIE?

Word Embeddings : Basic Idea


  • We take two, similar properties in two (or more) related languages, which allow us to approximate an earlier state in a source language

Word Embeddings : Basic Idea

  • It is in this way that we propose that the use of word embedding models created by descendant languages, to approximate an earlier state in the source language

Word Embeddings: Methods

  • As you can imagine, this stuff is complicated, which is why we won’t go into much detail about the specific methods;
    • see Github for four-page paper, code, data, etc.
  • If there are any questions that we can’t answer, we’ll forward them to Anton, who will be happy to do so

Word Embeddings, Problem #1: Vectors Across Models

  • Vectors generated for hyperspaces take on arbitrary values when training models
    • So ‘dog’ could = (0, 0) or (-6, 100) – these values change every time you run the model
  • For this reason, we must align models (Dev et al., 2021):
    • Identify substructures across language models that remain fixed
    • Use pre-aligned word embedding models (following Joulin et al., 2018)

Word Embeddings, Problem #2: Verification


  • So we take pre-aligned hyperspaces, and reconstruct an earlier, source hyperspace based on the hyperspaces provided
  • But how can we trust this methodology?
    • Obviously can’t verify the hyperspace of PIE through analysis of PIE texts!

Word Embeddings, Problem #2: Verification

Word Embeddings: Methods

  • We use existing aligned models of Spanish & French (Joulin et al., 2018) as a source of vector and word information
  • Words filtered out:
    • If there is no corresponding word in the other languages
    • Non-vocabulary, including words with non-language characters

Word Embeddings: Methods


  • Models trained on French & Spanish Wikipedia articles, include both vocabulary and non-words/words containing non-language characters, the latter of which were removed (7.3%, 6.6%, respectively)

  • Remaining words lemmatized, furthering reducing vocabulary by roughly 10%

Word Embeddings: Methods


  • To relate words together and find common words:
    • Words are translated into each other’s respective languages, using Google Translate, with a Python translation library deep-translator
    • The same is done with Latin, using the Latin corpus (from CLTK [the Tesserae Project]), which is lemmatized
  • Any word that cannot be lemmatized in Latin (such as Greek words) is removed from the corpus

Word Embeddings: Calculating *Latin

  1. If there isn’t a 1:1 correspondence between the Romance language & Latin, identify the centroid of lemma’s vector: language-word center
  2. Identify the centroid of both lwcs -> inter-language-word center
  3. Identify the closest vectors to the ilwc using cosine-distance
  4. Take the average of these two vectors to arrive at the approximate *Latin word vector

(Steps 3 & 4 done to hedge for translation errors and to avoid outliers)

Word Embeddings: Results

  • To identify the effectiveness of the method, recall that we want to compare the *Latin with the Latin
  • This is our “normal” model

Word Embeddings: Analogy vs. OddOneOut

  • The analogy task (Mikolov et al. 2013) is considered standard when evaluating word embedding models: London is to England as Paris is France
  • But it doesn’t work for languages with small corpora (LRLs), especially ones that aren’t modern
  • We follow Stringham & Izbicki 2020 in using the OddOneOut task, which has been demonstrated to be more accurate in these situations
  • OddOneOut task demonstrated to work for corpora as small as 1800 tokens (Old Gujarati)

Word Embeddings: Results

  • Our results indicate mixed success: we can create a hyperspace from the descendant languages that performs reasonably well on Latin tests with OddOneOut
  • Preliminary tests show that the smaller the corpus, the better the model works, as compared to the normal one
  • This is not exactly a bad thing, as this is the case for many languages in our discipline!
    • If a language has a large corpus (such as Ancient Greek), we can reduce it as we’ve done here for Latin

Word Embeddings: Results


  • Unclear what is the “magic number” for the size of corpus to arrive at most accurate representation (compared to the normal model)
  • Unclear exactly why performance of descendant models tends to decrease as the corpus size increases
    • Anton has ideas, all quite technical – see the draft for further

Word Embeddings: Problems with Current Model

  1. Use of Google translate suboptimal, may result in translation errors; should use either bilingual dictionaries or LLMs (think GPT) for more accurate translations
  2. LLMs > Word Embeddings
    1. Vectors: polysemy (e.g., ‘bank’)
    2. Vectors: context
    3. Vectors: precision (300 vs. 175B parameters)

Recap & Future Directions

Recap: WordNet

  1. Upsides: semi-universal structure
  2. Downsides:
    • must be done manually; requires scholars to make choices that are sometimes unknowable;
    • isn’t capable of showing certain types of semantic similarities/differences beyond synonymy, hyponymy, etc.

Recap: Word Embeddings


  1. Upsides: fully automated, requires low CPU processing
  2. Downsides: doesn’t distinguish multiple senses (polysemy), is less accurate than LLMs

Future Directions: WordNet Embedding?

  • Johannson and Pietro Nina 2015: build hyperspaces from systems like WordNet

    • We should be able to do this for many IE languages (mostly modern)

    • For languages without WordNets (like PIE), we “translate” the lexicon (< DERBi PIE) into a WordNet structure

Future Directions: WordNet Embedding?

  • Eliminate any matchings that are untrue (like *i̯eh₂- “drive” = “hit a golfball”)

    • Manually assign outliers as hyponyms, hypernyms, synonyms, etc. of existing lexemes
  • But same problems from before remain - less precise, must be done manually

Future Directions: using LLM models

  • Probably the best course of action is to stick with the Descendant Model strategy, but:

    • Utilize LLMs (such as GPT) for modelling (hyperspace) for greater precision and differentiation of polysemy

    • Instead of Google Translate, use bilingual dictionary or LLM

      • We’ve had great success with LLMs in the parsing of data for DERBi PIE

Future Directions: using LLM models

  • Add additional Romance languages;

  • when happy with results, move on to other subbranches (likely Slavic or Indic)

  • Iterate, iterate, iterate!

Thank you!

Download: Slides, Paper, Code